A multi-institutional study using artificial intelligence to provide reliable and fair feedback to surgeons

Background Surgeons who receive reliable feedback on their performance quickly master the skills necessary for surgery. Such performance-based feedback can be provided by a recently-developed artificial intelligence (AI) system that assesses a surgeon’s skills based on a surgical video while simultaneously highlighting aspects of the video most pertinent to the assessment. However, it remains an open question whether these highlights, or explanations, are equally reliable for all surgeons. Methods Here, we systematically quantify the reliability of AI-based explanations on surgical videos from three hospitals across two continents by comparing them to explanations generated by humans experts. To improve the reliability of AI-based explanations, we propose the strategy of training with explanations –TWIX –which uses human explanations as supervision to explicitly teach an AI system to highlight important video frames. Results We show that while AI-based explanations often align with human explanations, they are not equally reliable for different sub-cohorts of surgeons (e.g., novices vs. experts), a phenomenon we refer to as an explanation bias. We also show that TWIX enhances the reliability of AI-based explanations, mitigates the explanation bias, and improves the performance of AI systems across hospitals. These findings extend to a training environment where medical students can be provided with feedback today. Conclusions Our study informs the impending implementation of AI-augmented surgical training and surgeon credentialing programs, and contributes to the safe and fair democratization of surgery.

One significant drawback from the work is that there is no plan to release neither the videos nor the annotations. This both makes the contibution exclusive to this particular paper and finding and does not enable further work (which is pretty much guaranteed to be needed) to further propel the field.
The above aside, this is excellent work, and it much needed in this area. Yet there are some things to potentially discuss and specifically to unpick a bit more clearly in the text. For example: -The data has a huge number of participants. Seems like the average number of procedures (real) per surgeon is about 5 though unlikely to be distributed equally. Isnt this an aspect to investigate? Like what is the effect of explainability across individuals, not across institutions? -While the above is addressed in part by the bias across surgeons section, it feels a bit like grouping results and not giving confidence on the interpretations.
-Mixing the student and training studies with the real data studies seems confusing for me.
-Why not train on the other datasets too and validate the other way around, e.g. test on the USC data, rather than train on it. Does the same explanation stand up? -Did you consider ablation studies? -Technically, it does not seem suprising that injecting more information, improves the explainability. Would this consistently apply if swapping the data labelling methodology, e.g. not clipping videos, etc. How about if not using flow but using 3D networks? How about kinematics?
-I would find it helpful to have a full explanation of all the data. Either pictorial or with graphs. How many samples from each video; the effect on individual videos, etc.
Many thanks for the invitation to review this paper. The authors should be commended on presenting this paper on an important and highly necessary topic within the field of automated skill assessment.
The authors present impressive and exciting results, however my concerns centre around the dataset used and specifically how ground truth labels were determined which may call into question the reliability of the results presented.
1. Concerning skill assessment annotation, what were the background of the skill annotators (was there any clinical background?). Were any videos marked by experts? What was the justification chosen for 80% agreement as a threshold for adequate training?
2. Why was a binary classification of skill assessment chosen when the original cited assessment tool was on a 3-point Likert scale?
3. How was validation of assessors determination of important periods of video clips determined? Competency in assessment seems to be determined only by capability of assessing high vs low skill.
4. Were videos double assessed? What was the interrater variability of determination of critical periods among the cohort of assessors? Secondly, vs expert surgeons. 5. Could the authors further clarify the distribution of the human raters' perceived critical timestamps? Ie was there significant weighting to the first x% of the video or was this relatively equally distributed? 6. I question the combination of the use of medical students performing a task in a simulated environment to surgeons performing the steps in live surgery? Given the fact that the authors ultimately chose to focus on low skill participants, this will have constituted a significant proportion of the final dataset. In my opinion, limited focus is given to the justification of this crucial methodological decision.
7. More could be emphasised within the discussion around future clinical and training implications of this technology -should future aims be to extend beyond highlighting critical video frames to eg. providing narrative feedback Reviewer #3 (Remarks to the Author): In this project the authors address the very important task of using AI to assess surgical skill. To this end they collected data from multiple sites, developed a method for ranking performance and developed AI tools to identify skill. I believe that in general this is a very important project that may have a significant impact on the training a assessing of surgical skill.
In this specific study the authors evaluate the reliability of explanations of their AI algorithms, which are important tasks. My first main concern with this manuscript is that I found it hard to follow, and it took me several times until I understood their main contribution. I think the main issue is with the introduction which does not lay the foundation to what is done in the manuscript and what has been presented in other manuscripts. If I understood correction the authors define "explanation" as the ability to show which part of the data is the most important to reach the conclusion. For example if the video is 60 seconds long, highlight the 10 most important seconds. It might be my personal bias, but when I hear the word explanation, I think of more specific explanations such as "you are not holding the needle correctly". Nevertheless, the authors should be very explicit regarding their definition of explanation in this context. In essence, the authors compare the ability of general attention model to identify the most important part of the video clip and compare it to a model that is provided with explicit labels regarding the important parts of the video, this was very hard to understand. In addition, the authors mention SAIS and TWIX. However, they do not mention their source in the introduction. Only in the methods section it was made clear that SAIS was developed by the authors and presented in reference [11]. It isn't clear to me where TWIX is described properly. The authors show that TWIX provides better explanation. However, this is not surprising since it receives the explicit labels. I think that from an algorithmic point of view, perhaps the fact that SAIS was able to achieve partial explanation is more impressive, since it is an unsupervised task which actives nice results. I believe the introduction should be revised. It should be clear what was done in previous studies (by the authors) and what is new in this study. In addition, it should include better definitions.
On the other side the reset of the paper is a bit long and if possible, I would recommend shorting it I think the authors repeat sentences. I think the paper should be re-written, the introduction should provide a better discretion of were we are heading and the rest should be shorter. Some smaller comments: I think they might be able to combine Figure 3 & 4 to one figure. I find it very surprising that in figure 2, USC has lower results considering the fact the model was trained using USC.
We would like to thank the reviewers for taking the time and effort to review our manuscript and for providing us with valuable feedback. We address your comments below.
We would also like to mention that our previous study, in which we develop SAIS (the AI system underpinning this current study), has since been accepted at Nature Biomedical Engineering.

Reviewer 1 Summary
This paper is focused on the use of AI models on surgical video with the purpose of replicating human assessment of surgical skill. The main methodology used in the paper is based on prior work, either from the ML/CV community or from studies from the authors' groups. This is fine as this is a detailed analysis of the application to a multi-centre dataset. There is solid rigour and thinking behind the analysis and the explanations/narrative in the work.

R1 -Comment 1
One significant drawback from the work is that there is no plan to release neither the videos nor the annotations. This both makes the contribution exclusive to this particular paper and finding and does not enable further work (which is pretty much guaranteed to be needed) to further propel the field.

Response to R1 -Comment 1
We had outlined in the Data availability statement (from first manuscript submission) that we plan to release both the raw videos and the annotations for the data from the training environment (with medical students). To facilitate the reproducibility of our findings and propel the field forward, we also plan to share the real surgical videos from USC and their corresponding with researchers on a caseby-case basis. The Data availability statement (page 13) has been updated to reflect this.

R1 -Comment 2
The above aside, this is excellent work, and it much needed in this area. Yet there are some things to potentially discuss and specifically to unpick a bit more clearly in the text. For example: The data has a huge number of participants. Seems like the average number of procedures (real) per surgeon is about 5 though unlikely to be distributed equally. Isn't this an aspect to investigate? Like what is the effect of explainability across individuals, not across institutions?
While the above is addressed in part by the bias across surgeons section, it feels a bit like grouping results and not giving confidence on the interpretations.

Response to R1 -Comment 2
Two of the goals of our study were to (1) quantify the reliability of explanations generated by surgical AI systems and (2) measure the potential discrepancy (bias) in the reliability of explanations across surgeon sub-cohorts (e.g., novices vs. experts).
As with almost any AI system, it is always possible to stratify its performance at the level of an individual (e.g., surgeon). Although it is also possible to stratify the reliability of explanations at the surgeon level, we believe there is greater value, at least in the current scope of our study, to quantify the reliability of explanations at a more aggregated level (e.g., at the hospital level). This is because it allows us to examine whether our findings generalize across hospitals, which is often viewed as a rigorous approach to evaluating AI systems and methodologies such as TWIX. It signals to readers that TWIX can indeed learn from human supervision and generalize to held-out datasets, and thus increase its likelihood of adoption by future researchers.
As for the second goal, examining bias at the group level is a common choice made by researchers in the field who investigate algorithmic bias. In this context, and from a practical standpoint, we focus on groups of surgeons, as opposed to individual surgeons, due to the relatively larger number of samples in each group, thereby lending greater confidence to our findings.

R1 -Comment 3
Mixing the student and training studies with the real data studies seems confusing for me.

Response to R1 -Comment 3
SAIS was originally developed to assess the skills of surgeons based on videos of real robotic surgeries.
In this study, we demonstrated how SAIS and its explanations have the potential to be used for the provision of surgeon feedback. It is very likely that SAIS will be used, in the short run, to assess the skills of surgical trainees and provide them with feedback on their performance. The imminent use of SAIS for such an application motivated our inclusion of the results from the training environment. It is equally important to ensure that surgical trainees, particularly those upstream to practicing surgeons are not disadvantaged by AI skill assessment systems. We have included this motivation in the section Results → Providing feedback today in training environment (page 6).

R1 -Comment 4
COMMSMED-22-0371-T Point-by-point response Why not train on the other datasets too and validate the other way around, e.g. test on the USC data, rather than train on it. Does the same explanation stand up?

Response to R1 -Comment 4
SAIS was trained exclusively on data from USC and deployed on held-out datasets from USC, St. Antonius Hospital, and Houston Methodist Hospital. This decision was made primarily because of the larger number of samples from USC relative to the other hospitals (see Table 2 for exact number of video samples). By training on data from USC, SAIS was demonstrated to achieve strong generalization performance, an important prerequisite for evaluating the reliability of AI-based explanations. In other words, quantifying the reliability of AI-based explanations is almost moot if the underlying AI system generalizes poorly.
We do, however, appreciate the reviewer's comment about whether the "same explanations stand up". We interpret this statement as broadly referring to whether AI-based explanations are stable or robust to changes in the experimental setup (e.g., different training data, different learning protocols, etc.). To that end, we take the reviewer's suggestion from Comment 5 (next comment) and conduct two ablation studies where we (1) withhold the optical flow data modality when training SAIS and (2) train a multi-class skill assessment variant of SAIS, and quantify the reliability of explanations and the explanation bias in these settings (see Results → Ablation study, page 6, paragraph 1, and Figure 5, page 6). In short, we find that TWIX consistently improves the reliability of explanations and mitigates the explanation bias irrespective of the experimental setting in which it is deployed.

R1 -Comment 5
Did you consider ablation studies?

Response to R1 -Comment 5
Please see Response to R1 -Comment 4

R1 -Comment 6
Technically, it does not seem surprising that injecting more information, improves the explainability. Would this consistently apply if swapping the data labelling methodology, e.g. not clipping videos, etc. How about if not using flow but using 3D networks? How about kinematics?

Response to R1 -Comment 6
Although supervising the TWIX module with human explanations was expected to improve the reliability of AI-based explanations, this was not guaranteed to occur. Specifically, an AI system that is presented with supervised ground-truth labels must learn from such labels such that it is able to generalize to unseen samples. Our contribution is that we demonstrated that TWIX can indeed learn from human explanations and generalize across videos from three geographically-diverse hospitals.
As for quantifying the reliability of explanations under different experimental settings and variants of SAIS, we conducted a set of ablation studies that are described in Response to R1 -Comment 4 (see Results → Ablation study, page 6, paragraph 1, and Figure 5, page 6). In short, we find that TWIX consistently improves the reliability of explanations and mitigates the explanation bias irrespective of the experimental setting in which it is deployed.
In our original paper, in which we introduced the SAIS system, we demonstrated that SAIS outperforms the state-of-the-art 3D convolutional networks (Inception3D or I3D) on a multitude of tasks including surgeon skill assessment. Please note that these results are in the latest version of our original manuscript (not on arXiv) which has since been accepted at Nature Biomedical Engineering. In light of SAIS' improved performance in assessing surgeon skills relative to I3D, we do not experiment with I3D (for which obtaining frame-level explanations is non-trivial because of the way it processes volumes of frames). As for incorporating additional data modalities (e.g., kinematics), SAIS is a modular architecture that can accept (and ultimately aggregate) any number of input modalities. If kinematics data are available, which we do not currently have access to, then they can seamlessly be incorporated into the learning process.

R1 -Comment 7
I would find it helpful to have a full explanation of all the data. Either pictorial or with graphs. How many samples from each video; the effect on individual videos, etc.

Response to R1 -Comment 7
In the current manuscript, we had outlined the total number of videos and samples from each hospital and for each skill (needle handling and needle driving) (see Table 2, page 9). Supplementary Note 1 also outlines the number of samples in each surgeon sub-cohort, which are used for the experiments in which we stratify the reliability of explanations across sub-cohorts. A more complete description of the data can be found in the latest version of our original manuscript, which has since been accepted at Nature Biomedical Engineering.

Reviewer 2 Summary
Many thanks for the invitation to review this paper. The authors should be commended on presenting this paper on an important and highly necessary topic within the field of automated skill assessment. The authors present impressive and exciting results, however my concerns centre around the dataset used and specifically how ground truth labels were determined which may call into question the reliability of the results presented.

R2 -Comment 1
Concerning skill assessment annotation, what were the background of the skill annotators (was there any clinical background?). Were any videos marked by experts? What was the justification chosen for 80% agreement as a threshold for adequate training?

Response to R2 -Comment 1
To clarify, the skill assessment annotations used in this manuscript were obtained from, and are exactly the same as those used in, the original study describing the development and validation of a surgical AI system for decoding the elements of surgery. That study has since been accepted at Nature Biomedical Engineering.
In the original study, we assembled a team of trained human raters to annotate video samples with skill assessments based on a previously-developed skill assessment taxonomy (also known as an endto-end assessment of suturing expertise or EASE). EASE was formulated through a rigorous Delphi process which involved five expert surgeons that identified a strict set of criteria for assessing multiple skills related to suturing (e.g., needle handling, needle driving, etc.). Our team of raters comprised medical students and surgical residents who either helped devise the original skill assessment taxonomy themselves or had been intimately aware of the details of the taxonomy.
While video samples were not assessed by attending surgeons, we believe the degree of annotation noise is limited for the following reasons. First, EASE outlines a strict set of criteria related to the visual and motion content reflected in a video sample, thereby making it straightforward to identify whether such criteria are satisfied (or violated) upon watching a video sample. This reduces the level of expertise that a rater must ordinarily have in order to annotate a video sample. Second, the raters involved in the annotation process were either a part of the development of the EASE taxonomy or intimately aware of its details. This implied that they were comfortable with the criteria outlined in EASE. Third, and understanding that raters can be imperfect, we subjected them to a training process whereby raters were provided with a training set of video samples and asked to annotate them independently of one another. This process continued until the agreement of their annotations, which was quantified via inter-rater reliability, exceeded 80%. We chose this threshold based on (a) the level of agreement first reported in the study developing the EASE taxonomy and (b) an appreciation that natural variability is likely to exist from one rater to the next in, for example, the amount of attention they place on certain content within a video sample (Methods → Surgical video samples and annotations→ Skill assessment annotations (page 9 -10)

R2 -Comment 2
Why was a binary classification of skill assessment chosen when the original cited assessment tool was on a 3-point Likert scale?

Response to R2 -Comment 2
While EASE (the skill assessment taxonomy) does outline a set of criteria for classifying skill into three distinct categories (low vs. intermediate vs. high), the family of surgical AI systems (SAIS) which we leverage throughout this study was developed to perform binary skill assessment (low vs. high skill).
That decision was originally made for practical reasons, where we had an insufficient number of video samples annotated as intermediate skill to warrant their inclusion in the learning process of the AI system. We therefore opted to leverage the video samples annotated as low or high skill to develop a binary skill assessment system.
In this study, our use of a binary skill assessment system fits well with our goal of providing feedback for video samples that were annotated as depicting low skill activity. The motivation behind our focus on low skill activity is twofold. First, from a practical standpoint, it is relatively more straightforward to provide an explanation annotation for a video sample depicting low skill activity than it is for one depicting high skill activity. This is because human raters simply have to look for segments in the video sample during which one (or more) of the criteria outlined in EASE are violated. Second, from an educational standpoint, studies in the domain of educational psychology have demonstrated that corrective feedback following an error is instrumental to learning [1]. As such, our focus on a low skill activity (akin to an error) provides a ripe opportunity for the provision of feedback. We do appreciate, however, that feedback can also be useful when provided for video samples depicting high skill activity (e.g., through positive reinforcement). We leave this as an extension of our work for the future (Methods → Motivation behind focusing on low-skill activity, page 11, paragraph 1).
Having motivated our use of a binary skill assessment system, we also train SAIS to perform multiclass skill assessment (low vs. intermediate vs. high) for the skill of needle handling. In Results → Ablation study (page 6, paragraph 1) and Figure 5 (page 6), we present the reliability of SAIS' explanations in this setting and its explanation bias, before and after using TWIX. In short, we demonstrate that TWIX continues to improve the reliability of explanations and mitigate the explanation bias irrespective of the experimental setting in which it is deployed.

R2 -Comment 3
How was validation of assessors determination of important periods of video clips determined? Competency in assessment seems to be determined only by capability of assessing high vs low skill.

Response to R2 -Comment 3
We assembled a team of two trained human raters to annotate each video sample with segments of time (or equivalently, spans of frames) deemed relevant for a particular skill assessment. We define segments of time as relevant if they reflect the strict set of criteria (or their violation) outlined in the skill assessment taxonomy. In practice, we asked raters to exclusively annotate video samples previously tagged as low skill from a previous study. Our motivation for doing so is outlined in the Methods section. For the activity of needle handling, a low skill assessment is characterized by three or more grasps of the needle by the surgical instrument. For the activity of needle driving, a low skill assessment is characterized by either four or more adjustments of the needle when being driven through tissue or its complete removal from tissue in the opposite direction to which it was inserted. As such, raters had to identify both visual and motion cues in the surgical field of view in order to annotate segments of time as relevant (Methods → Surgical video samples and annotations → Skill explanation annotations, page 9 -10).
Before providing such explanation annotations, however, the raters underwent a training process akin to the one conducted for skill assessment annotations. First, raters were familiarized with the criteria outlined in the skill assessment taxonomy. In practice, and to mitigate potential noise in the explanation annotations, our assembled team of raters had, in the past, already been involved in providing skill assessment annotations while using the same exact taxonomy. The raters were then provided with a training set of low-skill video samples and asked to independently annotate them with segments of time that they believed were important to that skill assessment. During this time, raters were encouraged to abide by the strict set of criteria outlined in the skill assessment taxonomy. This training process continued until the agreement in their annotations, which was quantified via the intersection over union, exceeded 0.80. This implies that, on average, each segment of time highlighted by one rater exhibited an 80% overlap with that provided by another rater. This value was chosen, as with the skill assessment annotation process, having appreciated that natural variability in the annotation process is likely to occur. Raters may disagree, for example, on when an important segment of time ends even when both of their explanation annotations capture the bulk of the relevant activity (Methods → Surgical video samples and annotations → Skill explanation annotations → Training the raters, page 10, paragraph 4).
Upon completing the training process, raters were asked to provide explanation annotations for the video samples used in this study. They were informed that each video sample had been annotated in the past as low skill, and were therefore aware of the specific criteria in the taxonomy to look out for. In the event of disagreements in the explanation annotations, we considered the intersection of the annotations. This ensures that we avoid identifying potentially superfluous video frames as relevant and makes us more confident in the segments of time that overlapped amongst the raters' annotations. Although we experimented with other strategies for aggregating the explanation annotations, such as considering their union, we found this to have a minimal effect on our findings (Methods → Surgical video samples and annotations → Skill explanation annotations → Aggregating explanation annotations, page 10, paragraph 4).

R2 -Comment 4
Were videos double assessed? What was the interrater variability of determination of critical periods among the cohort of assessors? Secondly, vs expert surgeons.

Response to R2 -Comment 4
Yes, please see our Response to R2 -Comment 3.

R2 -Comment 5
Could the authors further clarify the distribution of the human raters' perceived critical timestamps? i.e., was there significant weighting to the first x% of the video or was this relatively equally distributed?

Response to R2 -Comment 5
To give readers a better appreciation of the ground-truth explanation annotations, we incorporate the reviewer's suggestion into our manuscript by presenting a heatmap of the explanations over time at the distinct hospitals and for the two skills (needle handling and needle driving). These heatmaps are shown in Figure 7 (Methods → Surgical video samples and annotations → Skill explanation annotations → generating and visualising explanation heatmaps, page 10, paragraph 1).
To generate these heatmaps, we considered unique video samples in the test set of each Monte Carlo fold (10 folds in total). Since each video sample may vary in duration, and to facilitate a comparison of the heatmaps across hospitals, we first normalized the time index of each explanation annotation such that it ranged from 0 (beginning of video sample) to 1 (end of video sample). In the context of needle handling, for example, this translates to the beginning and end of needle handling, respectively. As another example, a value of 0.20 refers to the first 20% of the video sample. We then averaged the explanation annotations, whose values are either 0 (irrelevant frame) or 1 (relevant frame), across the video samples for this normalized time index. We repeated the process for all hospitals and skills (needle handling and needle driving).

R2 -Comment 6
I question the combination of the use of medical students performing a task in a simulated environment to surgeons performing the steps in live surgery? Given the fact that the authors ultimately chose to focus on low skill participants, this will have constituted a significant proportion of the final dataset. In my opinion, limited focus is given to the justification of this crucial methodological decision.

Response to R2 -Comment 6
To clarify, we made the decision to focus on explanations associated with low-skill activity, and not necessarily low-skill participants. The distinction here is that experienced surgeons can still exhibit low-skill activity, according to our previously-developed skill assessment taxonomy. Conversely, medical students and surgical trainees can exhibit high-skill activity. Therefore, we believe that all individuals, irrespective of their experience, can benefit from surgical training and feedback.
We had provided a motivation for exclusively focusing on low-skill activity in the section Methods → Motivation behind focusing on low skill activity (page 11). As we mentioned in Response to R2 -Comment 2, we do appreciate that feedback can also be useful when provided for video samples depicting high skill activity (e.g., through positive reinforcement). We leave this as an extension of our work for the future.

R2 -Comment 7
More could be emphasised within the discussion around future clinical and training implications of this technology -should future aims be to extend beyond highlighting critical video frames to e.g., providing narrative feedback

Response to R2 -Comment 7
We have now expanded our Discussion section (page 7, paragraph 3) to outline, in more depth, the clinical and training implications of our framework and being dependent on AI-based explanations.

Reviewer 3 Summary
In this project the authors address the very important task of using AI to assess surgical skill. To this end they collected data from multiple sites, developed a method for ranking performance and developed AI tools to identify skill. I believe that in general this is a very important project that may have a significant impact on the training a assessing of surgical skill.

R3 -Comment 1
In this specific study the authors evaluate the reliability of explanations of their AI algorithms, which are important tasks. My first main concern with this manuscript is that I found it hard to follow, and it took me several times until I understood their main contribution. I think the main issue is with the introduction which does not lay the foundation to what is done in the manuscript and what has been presented in other manuscripts.

Response to R3 -Comment 1
To improve the clarity of the Introduction (page 1), we make the following changes: • Paragraph 1 -we clearly introduce SAIS (our previously-developed AI system). This should make it clear that SAIS has already been developed and, in this study, we are experimenting with and building upon it. • Paragraph 1 -we include our definition of "explanations" (which the reviewer had correctly understood as highlighting the most important frames in a video)